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Abstract 

We study the extent to which online social net¬ 
works can be connected to knowledge bases. The 
problem is referred to as learning social knowl¬ 
edge graphs. We propose a multi-modal Bayesian 
embedding model, GenVector, to learn latent top¬ 
ics that generate word embeddings and network 
embeddings simultaneously. GenVector leverages 
large-scale unlabeled data with embeddings and 
represents data of two modalities—i.e., social net¬ 
work users and knowledge concepts—in a shared 
latent topic space. Experiments on three datasets 
show that the proposed method clearly outper¬ 
forms state-of-the-art methods. We then deploy the 
method on AMiner, an online academic search sys¬ 
tem to connect with a network of 38,049,189 re¬ 
searchers with a knowledge base with 35,415,011 
concepts. Our method significantly decreases the 
error rate of learning social knowledge graphs in 
an online A/B test with live users. 


1 Introduction 


With the rapid development of online social networks, un¬ 
derstanding user behaviors and network dynamics becomes 
an important yet challenging issue for social network min¬ 
ing. Quite a few research works have been conducted towards 
dealing with this problem. For instance. Want et al. 120141 
developed an approach to infer topic-based diffusion net¬ 
works by considering different cascaded processes. Han and 
Tang [20151 proposed a probabilistic framework to model so¬ 
cial links, communities, user attributes, roles and behaviors 
in a unified manner. Sudhof et al. [20141 developed a theory 
of conditional dependencies between human emotional states 
and implemented the theory using conditional random fields 
(CRFs). However, all the aforementioned works do not con¬ 
sider linking social contents to a universal knowledge bases, 
and thus the mining results can only be applied to a specific 
social network. Tang et al. [20131 proposed the SOCINST 
model to extract entity information by incorporating both so¬ 
cial context and domain knowledge. However, users are not 
directly linked to knowledge bases, which limits deeper user 
understanding. 


To bridge the gap between social networks and knowl¬ 
edge bases, we formalize a novel problem of learning social 
knowledge graphs. More specifically, given a social network, 
a knowledge base, and text posted by users on social net¬ 
works, we aim to link each social network user to a given 
number of knowledge concepts. For example, in an aca¬ 
demic social network, the problem can be defined as link¬ 
ing each researcher to a number of knowledge concepts in 
Wikipedia to reflect the research interests. Learning social 
knowledge graphs has potential applications in user mod¬ 
eling, recommendation, and knowledge-based search Sig- 
urbjornsson and Van Zwol, 2008; Kasneci et a l., 2008~[ 
Tang et al., 2008) . 

Multi-modal topic models, such as author-to pic m odels 
[Rosen-Zvi et al., 20041 and Corr-LDA [Blei and Jordan, 
20031, can be extended to model the two modalities—i.e., 
social network users and knowledge concepts—in our prob¬ 
lem. However, topic models are usually trained on text, 
and it is difficult to leverage information in knowledge bases 
and the structure of social networks. Recent advances in 


embeddings [Mikolov et al., 2013 Bordes et al., 2013 
Perozzi et al., 20141 proposed to learn embeddings for words, 


knowledge concepts, and nodes in networks, which captures 
continuous semantics from unlabeled data. However, these 
embedding techniques do not model multi-modal correlation 
and thus cannot be directly applied to multi-modal settings. 

We propose GenVector, a multi-modal Bayesian embed¬ 
ding model, to learn social knowledge graphs. GenVector 
uses latent discrete topic variables to generate continuous 
word embeddings and network-based user embeddings. The 
model combines the advantages of topic models and word 
embeddings, and is able to model multi-modal data and con¬ 
tinuous semantics. We present an effective learning algorithm 
to iteratively update the latent topics and the embeddings. 

We collect three datasets for evaluation. Experiments show 
that GenVector clearly outperforms state-of-the-art methods. 
We also deploy GenVector into an online academic search 
system to connect a network of 38,049,189 researchers with 
a knowledge base with 35,415,011 concepts. We carefully 
design an online A/B test to compare the proposed model 
with the original algorithm of the system. Results show that 
GenVector significantly decreases the error rate by 67%. Our 
main contributions are as follows; 


• We formalize the problem of learning social knowledge 



































graphs with the goal of connecting large-scale social net¬ 
works with open knowledge bases. 

• We propose Gen Vector, a novel multi-modal Bayesian 
embedding model to model multi-modal embeddings 
with a shared latent topic space. 

• We show that GenVector outperforms state-of-the-art 
methods in the task of learning social knowledge graphs 
on three datasets and significantly decreases the error 
rate in an online A/B test on a real system. 



2 Problem Formulation 

The input of our problem includes a social network, a knowl¬ 
edge base, and text posted by users of the social network. 
The social network is denoted as Q r = (V r . £'), where V r is 
a set of users and £ r is a set of edges between the users, ei¬ 
ther directed or undirected. The knowledge base is denoted as 
Q k = ( V k ,C ), where V k is a set of knowledge concepts and 
C denotes text associated with or facts between the concepts. 
One example of the knowledge base is Wikipedia, where con¬ 
cepts are entities proposed by users and text information of a 
concept corresponds to the article associated with the entity. 
In general, our problem setting is applicable to any specific C 
as long as we can learn knowledge concept embeddings from 
C. Social text posted by users is denoted as V. Given a user 
u € V r , d u £ V denotes a document of all text posted by it. 
Each user u has only one document d u . 

The output of the problem is a social knowledge graph Q = 
(V r ,V fc ,?). More specifically, given a user u £ V r , V u 
is a ranked list of top-// knowledge concepts in V k , where 
the order indicates the relatedness to user it. For example, in 
an academic social network, the algorithm outputs the top-fc 
research interests of each researcher it as a ranked list V u - 

There are two modalities in this problem, social network 
users and knowledge concepts. Previous problem settings 
usually consider only one of the two jnodalities. For ex¬ 
ample, social tag prediction iHeymann et al., 20081 aims to 
assign tags to social network users without linking tags to 
knowledge bases. On the contrary, entity recognition in so¬ 
cial context [Tang et al., 2013] extracts entities from social 
text without directly linking entities to users. In this sense, 
the problem of learning social knowledge graphs is techni¬ 
cally challenging because we need to leverage information of 
both users and concepts. 

3 Model Framework 

We propose GenVector, a multi-modal Bayesian embed¬ 
ding model for learning social knowledge graphs. To jointly 
model multiple modalities, GenVector learns a shared latent 
topic space to generate network-based user embeddings and 
text-based concept embeddings in two different embedding 
spaces. 

GenVector takes pretrained knowledge concept embed¬ 
dings and user embeddings as input, where the pretrained 
embeddings encode information from the knowledge base Q k 
and the social network Q r . In other words, embeddings are 
given as observed variables in our model. We use the Skip- 
gram model [Mikolov et al., 20131 to learn knowledge con¬ 


Figure 1: GenVector: a multi-modal Bayesian embedding 
model. Each document d u contains all text posted by a social 
network user it (each user has only one document). Embed¬ 
dings are observed variables (dark circles). 


cept embeddings, and use Deep Walk [Perozzi et al., 20141 to 
learn network-based user embeddings. 

3.1 Generative Process 

GenVector is a generative model, and the generative process 
is illustrated in Figure [T| where we plot the graphical repre¬ 
sentation with D documents and T topics. Each document d u 
contains a user it with all text posted by it. 

The generative process is as follows: 

1. For each topic t, and for each dimension 

(a) Draw X r t from NormalGamma(r T ') 

(b) Draw ji k , X k from NormalGamma(r fc ) 

2. For each user u 

(a) Draw a multinomial distribution 6 from Dir(o:) 

(b) For each knowledge concept w in d u 

i. Draw a topic z from Multi (9) 

ii. For each dimension of the embedding of w, draw 
f k fromW(/4, \ k ) 

(c) Draw a topic y uniformly from all z’s in d u 

(d) For each dimension of the embedding of user it, 
draw f r from M(y r y , X y ) 

where notations with superscript k denotes parameters de¬ 
fined for knowledge concepts and notations with r for net¬ 
work users; r is the hyperparameter of the normal Gamma 
distribution; // and A are the mean and precision of the Gaus¬ 
sian distribution; a is the hyperparameter of the Dirichlet dis¬ 
tribution; 6 U is the multinomial topic distribution of document 
d u (or user it); z um is the topic of the m-th knowledge con¬ 
cept in document d u \ y u is the topic of user it; A/"(•) denotes 
the Gaussian distribution. Similarly, f k rn are knowledge con¬ 
cept embeddings, while are network-based user embed¬ 
dings. We drop the subscripts or superscripts when there is 
no ambiguity. 

Note that although it is possible to draw the embeddings 
from a multivariate Gaussian distribution, we draw each di¬ 
mension from a univariate Gaussian distribution separately 
instead, because it is more computationally efficient and 
also practically performs well. Although developed indepen¬ 
dently, our model extends Gaussian LDA [Das et al., 20151 to 


































Algorithm 1: Model Inference 


Input: Training data V , hyperparameters t, a, initial 

embeddings f r , f k , burn-in iterations tb, max iterations 
tm, latent topic iterations ti, parameter update period t p 
Output: latent topics z, y, model parameters A, p, 6, updated 
embeddings f r , f k 
// Initialization 

1 random initialize z, y 
// Sampling 

2 for t <— I to tm do 
for t! t— 1 to ti do 

foreach latent topic z do 

Draw z according to Eq. (|2j 

foreach latent topic y do 

Draw y according to Eq. |TJl 

if t p iterations since last read-out and t > tb then 
Read out parameters according to Eq. |3| 

Average all read-outs 


it 

12 


if t > tb then 

j Update the embeddings according to Eq. |5j( 


13 return z, y, X k , \ r , p k , p r , 6, f r , f k 


model multiple modalities. Similar multi-modal techniques 
were also used in Corr-LDA iBlei and Jordan, 20031. How¬ 
ever, different from Corr-LDA, we generate continuous em¬ 
beddings in two spaces and use normal Gamma distribution 
as the prior. 

3.2 Inference 

We employ collapsed Gibbs sampling [Griffiths, 20021 to do 
inference. The basic idea of collapsed Gibbs sampling is to 
integrate out the model parameters and then perform Gibbs 
sampling. Due to space limitations, we directly give the con¬ 
ditional probabilities of the latent variables. 


p(Vu = t\y-u,z,f r ,f k ) oc ( n u + l ) G'{f \y,t,e,T r ,u ) (1) 


p(Zum = t\Z- um ,y,f,f ) 

E k 

oc (nl u + l)(n u + a t ) ]^[ G'(f k ,z,t,e,r k ,um) 


( 2 ) 


where the subscript _ u means ruling out dimension u of a 
vector; n* u is the number of knowledge concepts assigned to 
topic t in document d u ; E r and E k are the dimensions of 
user embeddings and knowledge concept embedding s; l i s the 
smoothing parameter of Laplace smoothing [Manning et al., 
[20081 . 

The function G'(-) is given as follows 


G'(f,y,t,e,T,u) = 


r(a„) f K n '\ 2 ( 2 tt) n/2 


r (a n ') \KnJ (27t)-"'/ 2 


with 


= a 0 + n/2, n n = n o + n, p n = 


K 0 po+nx 
Ko +n 


1 n t 

= Po + 2 ^(Xi-x) 2 + 

z i =1 


2 . Ko n(x - Ho) 2 


2 (k 0 + n) 

where r = {c^o, Po, no, Po} are the hyperparameters of the 
normal Gamma distribution; n is the number of i’s with ij, = 
f; x is a vector of concatenating the e-th dimension of /,’s 
with yt = t ; n! = n — 1 if y u = t, otherwise n' = n; x is the 
mean of all dimensions of x. 

By taking the expectation of the posterior probabilities, we 
update the model parameters by 


at 


Et'=i« +at') 


= 


a 0 


Vt = 


i/2 


K 0 po + nx 
Ko + n 


Kpn(x—no) 2 


(3) 


Po+ 2 Yhi( x i x) 2 + 2 (/t 0 +n) 

In Gaussian LDA [Das et al., 20151, the word embeddings 


were kept fixed during inference. Unlike their approach, in 
our model, we update the embeddings during inference to 
adapt to different specific problems. Let W be the number 
of knowledge concepts. We write the log likelihood of the 
data given the model parameters as 


D T E r 




U=1 t =1 e=l 


(4) 


+ - the) 2 

w=l t =1 e=l 

We employ gradient ascent to maximize the log likelihood 
by updating the embeddings f r and f k . The gradients are 
computed as 

EJG =E-W--/^)> = fy»(-A te)(fL-Pte) 

UJue ^\ VJ we t _2 

(5) 

The inference procedure is summarized in Algorithm [T] 
Following [Heinrich, 20051, we set a burn-in period for tb it¬ 
erations, during which we do not update embeddings or read 
out parameters. Similar to [Bezdek and Hathaway, 20031, we 


employ alternating optimization for inference. More specif¬ 
ically, we first fix the embeddings to sample the topics and 
infer the model parameters (Cf. Line [3] - [10] Algorithm [IJ. 
After a number of iterations, we fix the topics and parame¬ 
ters, and use gradient ascent to update the embeddings (Cf. 
Line 11 -[12} Algorithm [TJ. We repeat the procedure for a 
given number of iterations. 

3.3 Prediction 

Given a user u and a knowledge concept w, let g uw de¬ 
note whether y u is drawn from z w . Conditioned on u and 
the model parameters, we compute the joint probability of 
guw = 1 and generating the embedding f k , 

p(g uw = 1,/£|/D 

T 

OC ^2(p(Zw = t)p(g uw = l)p(fd\y u = t)p(fm\z w = t) 

t =1 
T 

oc £>«(< +i)N{r u \\ r t ,p r t )N{fi\\ k t ,p k t ) (6) 


t =1 

































Table 1: Data Statistics 


# Social network users 

# Publications 

# Knowledge concepts 
Corpus size in bytes 


38,049,189 

74,050,920 

35,415,011 

20,552,544,886 


For each user u, we rank the interacted knowledge concepts 
w according to Eq. to obtain V u . In this way, we con¬ 
struct a social knowledge graph via learning the multi-modal 
Bayesian embedding model. Note that although it is possible 
to rank the knowledge concepts by deriving p(f k \fu), our 
preliminary experiments show that using Eq. gives better 
results. 


4 Experiments 

In this section, we perform a series of experiments to evaluate 
the proposed methods. We compare our models with state-of- 
the-art models on three datasets, and also design an online test 
on our system to demonstrate the effectiveness of our method. 


4.1 Data and Evaluation 

We deploy our algorithm and run the experiments on 
AMineiPl an online academic search system [Tang et ai. 


20081. The academic social network Q r is constructed by 
viewing each researcher as a user, and undirected edges rep¬ 
resent co-authorships between researchers. There may be 
multiple edges between a pair of researchers if they collab¬ 
orate multiple times. We use the publicly available English 
Wikipedia as the knowledge base Q k . Each “category” or 
“page” in Wikipedia is viewed as a knowledge concept. We 
use the full-text Wikipedia corpu^j as the text information 
C to learn the knowledge concept embeddings. Social text 
V is derived from publications, where document d u for re¬ 
searcher u contains all publications authored by u. If a pub¬ 
lication has multiple authors, the publication is repeated for 
each researcher in their corresponding documents. The basic 
statistics are shown in Table [T] We compare the following 
methods. 

GenVector is our model proposed in Section[3] We empir¬ 
ically set /i 0 = 0, kq = 1e-5, /3o = 1, ao = 1e3, T = 200, 
a = 0.25. 

GenVector-E is a variation of GenVector without updat¬ 
ing the embeddings in Line 12 of Algorithm [T] We compare 
GenVector-E with GenVector to evaluate the benefit of em¬ 
bedding update. 

Sys-Base is the original algorithm adopted by our system. 
Sys-Base first extracts key terms usin g a stat e-of-the-art NLP 
rule based extraction algorithm [Mundy and Thornthwaite, 
20071, and sorts the key terms by frequency. 

CountKG extracts knowledge concepts from social text V 
by referring to the knowledge concept set V k , and ranks the 
concepts by appearance frequency. 


Author-Topic learns an author-topic model iRosen- 
Zvi et al., 2004) and ranks the knowledge concepts by 


1 https ://aminer.org/ 

2 https://dumps. wikimedia.org/enwiki/latest/ 


Table 2: Precison@5 of Homepage Matching 


Method 

Precision@5 

GenVector 

78.1003% 

GenVector-E 

77.8548% 

Sys-Base 

73.8189% 

Author-Topic 

74.4397% 

NTN 

65.8911% 

CountKG 

54.4823% 


Xw=i p( w \t)p(t\ u )' where t,u,w denote topic, user and 
knowledge concept respectively. We set T = 20 0 , a = 0.25. 

NTN is a neural tensor network [Socher et al.,_ 20131 that 
takes f k and as the input vector, and outputs the probabil¬ 
ity of u matching w. We perform cross validation to set the 
weighting factor A = 1e-2 and the slice size k = 4. 

It is difficult in practice to directly evaluate the results 
of social knowledge graphs. Instead, we consider two 
strategies—offline evaluation on three data mining tasks and 
an online A/B test with live users. 

4.2 Offline Evaluation 

We collect three datasets for evaluation. We first learn a social 
knowledge graph based on the data described in Section 4.1 
(we discard the long-tailed users that do not appear in our 
evaluation datasets). For each researcher u, we treat V v as the 
research interests of the researcher. Then we use the collected 
datasets in the following sections to evaluate the precision of 
the research interests. 

Homepage Matching 

We crawl 62,127 researcher homepages from the web. Af¬ 
ter filtering out those pages that are not informative enough 
(# knowledge concepts < 5), we obtain 1,874 homepages. 
We manually identify the research interests that are explicitly 
specified by the researcher on the homepage, and treat those 
research interests as ground truth. We then evaluate different 
methods based on the ground truth and report the precision of 
the top 5 knowledge concepts. The performances are listed in 
Table 2] GenVector outperforms Sys-Base, Author-topic, and 
NTN by 5.8%, 4.9%, and 18.5% respectively. 

By comparing NTN with GenVector, we show that taking 
the learned embeddings as input without exploiting the latent 
topic structure cannot result in good performance, although 
NTN is among the most expressive models given plain vec¬ 
tors as input [Socher et al., 20131. NTN does not perform 
well because it has no prior knowledge about the underlying 
structure of data, and it is thus difficult to learn a mapping 
from embeddings to a matching probability. GenVector per¬ 
forms better than Author-Topic, which indicates that incorpo¬ 
rating knowledge concept embeddings and user embeddings 
can boost the performance. In this sense, GenVector success¬ 
fully leverages both network structure (by learning the user 
embeddings) and large-scale unlabeled corpus (by learning 
the knowledge concept embeddings). 

GenVector also significantly outperforms Sys-Base and 
CountKG. Sys-Base and CountKG compute the importances 
of the knowledge concepts by term frequency. For this rea¬ 
son, the extracted knowledge concepts are not necessarily 


























Table 3: Precision@5 of Linkedln Profile Matching 


Method 

Precision@5 

GenVector 

50.4424% 

GenVector-E 

49.9145% 

Author-Topic 

47.6106% 

NTN 

42.0512% 

CountKG 

46.8376% 


semantically important. Sys-Base is better than CountKG 
because Sys-Bas e u ses the key term extraction algorithm 
iMundy and Thornthwaite, 20071 to filter out frequent but 
unimportant knowledge concepts. 

The difference between GenVector-E and Gen Vector indi¬ 
cates that updating the embeddings can further improve the 
performance of the proposed model. This is because updating 
the embeddings to fit the data in specific problems, is better 
than using general embeddings learned from unlabeled data. 


Linkedln Profile Matching 

We design another experiment to evaluate the methods based 
on the Linkedln profiles of researchers. We employ the net¬ 
work linking algorithm COSNET I Zhang et al., 2015) to link 
the academic social network on our system to the Linkedln 
network. More specifically, given a researcher on our system, 
COSNET finds the according profile on Linkedln, if any. 

We first select the connected pairs with highest probabili¬ 
ties given by COSNET, and then manually select the correct 
ones. We use the selected pairs as ground truth, e.g., A on our 
system and B on Linkedln are exactly the same researcher in 
the physical world. 

Some Linkedln profiles of researchers have a field named 
“skills”, which contains a list of expertise. Once a researcher 
accept endorsements on specific expertise from their friends, 
the expertise is appended to the list of “skills”. After filtering 
out researchers with less than five “skills”, we obtain a dataset 
of 113 researchers. We use the list of “skills” as the ground 
truth of research interests. We report the precision of top 5 
research interests in Table [2 Since some of the “skills” are 
not necessarily research interests (e.g. Python), we focus on 
precision and do not consider recall-based evaluation metrics. 

We can observe from Table[3]that Gen Vector gives the best 
performance. GenVector outperforms CountKG, Author- 
Topic and NTN by 7.7%, 5.9%, and 20.0% respectively (sign 
test over samples p < 0.05). Updating the embeddings im¬ 
proves the performance by 1.1%. 


Intruder Detection 

In this experiment, we employ human efforts to judge the 
quality of the social knowledge graph. Since annotating re¬ 
search interests is somewhat subjective, we label the research 
interests that are clearly not r elevant I Liu et al., 2009| , also 
known as intruder detection I Chang et al., 2009) . In other 
words, instead of identifying the research interests of a re¬ 
searcher, we label what are definitely NOT the research inter¬ 
ests of a researcher, e.g., “challenging problem” and “training 
set”. 

We randomly pick 100 high cited researchers on our sys¬ 
tem. Lor each researcher, we run different algorithms to out- 


Table 4: Error Rate of Irrelevant Cases 

Method 

Error Rate 

GenVector 

1.2% 

Sys-Base 

18.8% 

Author-Topic 

1.6% 

NTN 

7.2% 


Doesjiawei Han have these skills or expertise? 



Ligure 2: Questionnaire: Leveraging Collective Intelligence 
for Evaluation 


put a ranked list of knowledge concepts. We combine the 
top 5 knowledge concepts of each algorithm and perform a 
random shuffle. The labeler then labels clearly irrelevant re¬ 
search interests in the given list of knowledge concepts. We 
report the error rate of each method in Table [4] 

According to Table [4] GenVector produces less irrelevant 
knowledge concepts than other methods. It is because Gen¬ 
Vector leverages large-scale unlabeled corpus to encode the 
semantic information into the embeddings, and therefore is 
able to link researchers to major research interests. 

4.3 Online Test 

To further test the performance of our algorithm, we deploy 
GenVector on our online system with the full dataset de¬ 
scribed in Section |4.1| We leverage collective intelligence 
by asking the users to select what they think are the research 
interests of the given researcher. 

Since Sys-Base is the original algorithm adopted by our 
system, we perform an online test by comparing GenVector 
with Sys-Base to evaluate the performance gain. Lor each 
researcher, we first compute the top 10 research interests pro¬ 
vided by the two algorithms. Then we randomly select 3 re¬ 
search interests from each algorithm, and merge the selected 
research interests in a random order. When a user visits the 
profile page of a researcher, a questionnaire is displayed on 
top of the profile. A sample is shown in Ligure [2] Users can 
vote for research interests that they think are relevant to the 
given researcher. 

We collect 110 questionnaires in total, and use them as 
ground truth to evaluate the algorithms. The error rates of 
different algorithms are shown in Table We can observe 
that GenVector decreases the error rate by 67%. Moreover, 
the error rate of GenVector is lower than or equal to that of 
Sys-Base for 95.45% of the collected questionnaires (sign test 
over samples p <C 0.01). 


Table 5: Error Rate of Online Test 


Method 

Error Rate 

GenVector 

3.33% 

Sys-Base 

10.00% 
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run time in seconds 


Figure 3: Run Time and Convergence: log-likelihood v.s. run 
time in seconds. 


4.4 Run Time and Convergence 

Figure [3] plots the run time and convergence of GenVector 
and GenVector-E. During the burn-in period, GenVector and 
GenVector-E perform identically because GenVector does not 
update the embeddings during the period. After the burn- 
in period, the likelihood of GenVector continues to increase 
while that of GenVector-E remains stable, which indicates 
that by updating the embeddings, GenVector can better fit 
the data, which leads to better performance shown in previ¬ 
ous sections. The experiments were run on Intel(R) Xeon(R) 
CPU E5-4650 0 @ 2.70GHz with 64 threads. 

4.5 Case Study 

Table[6]shows the researchers and knowledge concepts within 
each topic, output by GenVector and Author-Topic, where 
each column corresponds to a topic. As can be seen from 
Table [6] Author-Topic identifies several irrelevant concepts 
(judged by human) such as “integrated circuits” in topic #1, 
“food intake” in topic #2, and “in vitro” in topic #3, while 
GenVector does not have this problem. 


5 Related Work 

Variants of topic models [Hofmann, 1999 Blei et al., 2003) 
represent each word as a v ector of topic-specific p robabili- 
ties. Although Cor r-LDA | Blei and Jordan, 2003| and the 
author-topic model [Rosen-Zvi et al, 20041 can be used for 
multi-modal modeling, the topic models use discrete repre¬ 
sentation for observed variables, and are not able to exploit 
the continuous semantics of words and authors. 


Learning embeddings [Mikolov et al., 2013 Levy and 
Goldberg, 2014; Perozzi et a l., 2014] Rs effective at mod¬ 
eling continuous semantics with large-scale unlabeled data, 
e.g., knowledge bases and network structure. Neural ten¬ 
sor networks [Socher et al., 20131 are expressive models for 
mapping the embeddings to the prediction targets. However, 
GenVector can better model multi-modal data by basing the 

embeddings on a generative process_from latent topics ; _ 

Recently a few research works I Das et al., 2015[ Wan et al., 


20121 propose hybrid models to combine the advantages of 


topic^models and embeddings. Gaussian embedding models 
IVilnis and McCallum, 20151 learn word representation via 


Table 6: Knowledge Concepts and Researchers of Given Top¬ 
ics. * marks relatively irrelevant concepts. 


Topic #1 

Topic #2 

Topic #3 

GenVector 

query expansion 

image processing 

hepatocellular carcinoma 

concept mining 

face recognition 

gastric cancer 

language modeling 

feature extraction 

acute lymphoblastic leukemia 

information extraction 

computer vision 

renal cell carcinoma 

knowledge extraction 

image segmentation 

glioblastoma multiforme 

entity linking 

image analysis 

acute myeloid leukemia 

language models 

feature detection 

peripheral blood 

named entity recognition 

digital image processing 

malignant melanoma 

document clustering 

machine learning algorithms 

hepatitis c virus 

latent semantic indexing 

machine vision 

squamous cell carcinoma 

Thorsten Joachims 

Anil K. Jain 

Keizo Sugimachi 

Jian Pei 

Thomas S. Huang 

Setsuo Hirohashi 

Christopher D. Manning 

Peter N. Belhumeur 

Masatoshi Makuuchi 

Raymond J. Mooney 

Azriel Rosenfeld 

Morito Monden 

Charu C. Aggarwal 

Josef Kittler 

Yoshio Yamaoka 

William W. Cohen 

Shuicheng Yan 

Kunio Okuda 

Eugene Chamiak 

David Zhang 

Yasuni Nakanuma 

Kamal Nigam 

Xiaoou Tang 

Kendo Kiyosawa 

Susan T. Dumais 

Roberto Cipolla 

Masazumi Tsuneyoshi 

T. K. Landauer 

David A. Forsyth 

Satoru Todo 


Author-Topic 


speech recognition 
natural language 

* integrated circuits 
document retrieval 
language models 
language model 

* microphone array 
computational linguistics 

* semidefinite programming 
active learning 

face recognition 

* food intake 
face detection 
image recognition 

* atmospheric chemistry 
feature extraction 
statistical learning 
discriminant analysis 
object tracking 

* human factors 

hepatocellular carcinoma 
kidney transplantation 
cell line 

differential diagnosis 
liver tumors 
cell lines 

squamous cell carcinoma 
* in vitro 
kidney transplant 
lymph nodes 

James F. Allen 

Anil K. Jain 

Keizo Sugimachi 

Christopher D. Manning 

Kevin W. Bowyer 

Giuseppe Remuzzi 
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a Gaussian generative process to encode hierarchical struc¬ 
ture of words. However, these models are proposed to ad¬ 
dress other issues in semantic modeling, and cannot be di¬ 
rectly used for multi-modal data. 

Learning social knowledge graphs is also related to key¬ 
word extraction. Different from conventional keyword ex- 
traction methods l|Liu et al, 2009; Mundy and Thornthwaite, 


2007] |Matsuo and Ishizuka, 2004[ |Rao et al., 2013| , our 
method is based on topic models and embedding learning. 


6 Conclusion 

In this paper, we study the problem of learning social knowl¬ 
edge graphs. We propose GenVector, a multi-modal Bayesian 
embedding model, to jointly incorporate the advantages of 
topic models and embeddings. GenVector models the net¬ 
work embeddings and knowledge concept embeddings in a 
shared topic space. We present an effective learning algo¬ 
rithm that alternates between topic sampling and embedding 
update. Experiments show that GenVector outperforms state- 
of-the-art methods, including topic models, embedding-based 
models and keyword extraction based methods. We deploy 
the algorithm on a large-scale social network and decrease 
the error rate by 67% in an online test. 
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